Data used - MapZen Weekly OpenStreetMaps Metro Extracts
Data Cleaning Source Code: ipython notebook, github
My Nanodegree Portal: fch808.github.io
All Data Wrangling Exercises: ipython notebook
Map Areas:
In [ ]:
for st_type, ways in abq_st_types.iteritems():
for name in ways:
better_name = update_name(name, mapping)
if name != better_name:
print name, "=>", better_name
In [ ]:
Honolulu:
Kalakaua Ave => Kalakaua Avenue
Lusitania St. => Lusitania Street
...
Albuquerque:
Valley View Dr NW => Valley View Drive Northwest
Eubank Northeast Ste E-18 => Eubank Northeast Suite E-18
...
Initially, some cleanup was needed for the zip codes. This included
After adding a zip code cleaning function, all zip codes adhered to a 5 digit code, but a few (such as 89197 and 87100) with only 1 entry may have been entered incorrectly since they may belong to other countries.
In [ ]:
def map_aggregate(db, collection, pipeline):
db_collection = db[collection]
result = db_collection.aggregate(pipeline)
return result
# This function will be reused in further MongoDB explorations.
# Only the pipeline will need to be updated.
def make_city_pipeline(city):
pipeline = [{"$match":{"address.postcode":{"$exists":1},
"city_name":city}},
{"$group": {"_id": {"City":"$city_name",
"Zip":"$address.postcode"},
"count": {"$sum": 1}}},
{"$project": {'_id':0,
"City":"$_id.City",
"Zipcode":"$_id.Zip",
"Count":"$count"}},
{"$sort": {"Count": -1}},
{"$limit" : 10 }]
return pipeline
pipeline = make_city_pipeline('honolulu')
result1 = map_aggregate(db, 'cities', pipeline)
pipeline = make_city_pipeline('albuquerque')
result2 = map_aggregate(db, 'cities', pipeline)
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 219, u'Zipcode': 96815
...}
u'result': [{u'City': u'albuquerque', u'Count': 476, u'Zipcode': 87123
...}]}
Two cities were chosen for this project. One city, Albuquerque, was my birth city, and the other, Honolulu, is where I had lived for the last 7 years. It would be interesting to compare the different distributions present in each city. This also posed a small problem of modifying the initial MongoDB loading script to accommodate multiple cities.
The solution was to add a 'city_name' tag name to be passed in with the json data when loading. We could then easily add this key-value pair to each record. We can also make use of projections to clean up what's returned since we are having to match cities first, then group by multiple conditions.
First, we can see some statistics for our dataset through MongoDB pymongo queries.
In [ ]:
albuquerque_new-mexico.osm - 53.8 MB
albuquerque_new-mexico.osm.json - 65.7 MBm
honolulu_hawaii.osm - 43.8 MB
honolulu_hawaii.osm.json - 54.4 MB
In [6]:
print "Both Cities:", db.cities.find().count()
print "Honolulu:", db.cities.find({'city_name':'honolulu'}).count()
print "Albuquerque:", db.cities.find({'city_name':'albuquerque'}).count()
In [1]:
Both Cities: 489274
Honolulu: 227752
Albuquerque: 261522
In [ ]:
print "Both Cities:", db.cities.find({'type':'node'}).count()
print "Honolulu:", db.cities.find({'type':'node', 'city_name':'honolulu'}).count()
print "Albuquerque:", db.cities.find({'type':'node','city_name':'albuquerque'}).count()
In [1]:
Both Cities: 432587
Honolulu: 206262
Albuquerque: 226325
In [1]:
print "Both Cities:", db.cities.find({'type':'way'}).count()
print "Honolulu:", db.cities.find({'type':'way', 'city_name':'honolulu'}).count()
print "Albuquerque:", db.cities.find({'type':'way', 'city_name':'albuquerque'}).count()
In [1]:
Both Cities: 56648
Honolulu: 21459
Albuquerque: 35189
In [1]:
print "Contributors:", len(db.cities.distinct("created.user"))
In [1]:
Contributors: 611
Since we are reusing the make_city_pipeline() function defined earlier when examining zip codes, we can just show the pipeline that is passed in for our new MongoDB pymongo queries. Everything else would remain the same.
Parking is by far the most reported amenity in either city. It seem that it is most important to find places to park our automobiles, even on small islands.
In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "city_name":city}},
{"$group": {"_id": {"City":"$city_name", "Amenity":"$amenity"},
"count": {"$sum": 1}}},
{"$project": {'_id':0,"City":"$_id.City",
"Amenity":"$_id.Amenity", "Count":"$count"}},
{"$sort": {"Count": -1}},
{"$limit" : 5 }]
In [ ]:
u'result': [{u'Amenity': u'parking', u'City': u'honolulu', u'Count': 280},
{u'Amenity': u'restaurant', u'City': u'honolulu', u'Count': 123},..}
u'result': [{u'Amenity': u'parking', u'City': u'albuquerque',u'Count': 1270},
{u'Amenity': u'school', u'City': u'albuquerque', u'Count': 258},..}
The top contributor for each city made up the lion's share of the edits. Interestingly, woodpeck_fixbot showed up in both cites taking the #3 spot in Honolulu and #5 spot in Albuquerque.
In [ ]:
pipeline = [{"$match":{"created.user":{"$exists":1},
"city_name":city}},
{"$group": {"_id": {"City":"$city_name", "User":"$created.user"},
"count": {"$sum": 1}}},
{"$project": {'_id':0, "City":"$_id.City",
"User":"$_id.User", "Count":"$count"}},
{"$sort": {"Count": -1}},
{"$limit" : 5 }]
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 98401, u'User': u'Tom_Holland'},
{u'City': u'honolulu', u'Count': 13051, u'User': u'ikiya'},...}
u'result': [{u'City': u'albuquerque', u'Count': 88041, u'User': u'EdHillsman'},
{u'City': u'albuquerque', u'Count': 37604, u'User': u'anjbe'},...}
Religion is bigger and more one-sided in the Southwest. Albuquerque is very Christian with plentiful churches. Honolulu doesn't report many churches and includes Buddhism as a major relative percentage of those entered.
In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
"amenity":"place_of_worship", "city_name":city}},
{"$group":{"_id": {"City":"$city_name", "Religion":"$religion"},
"count":{"$sum":1}}},
{"$project":{"_id":0, "City":"$_id.City",
"Religion":"$_id.Religion", "Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":6}]
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 14, u'Religion': u'christian'},
{u'City': u'honolulu', u'Count': 5, u'Religion': u'buddhist'},..}
u'result': [{u'City': u'albuquerque', u'Count': 186,u'Religion': u'christian'},
{u'City': u'albuquerque', u'Count': 36, u'Religion': 'NoNameGiven'},..}
For restaurants, pizza is most popular in the isles; regional mexican in the SW.
In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
"amenity":"restaurant", "city_name":city}},
{"$group":{"_id":{"City":"$city_name","Food":"$cuisine"},
"count":{"$sum":1}}},
{"$project":{"_id":0, "City":"$_id.City",
"Food":"$_id.Food", "Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":6}]
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 7, u'Food': u'pizza'},
{u'City': u'honolulu', u'Count': 3, u'Food': u'regional'},
{u'City': u'honolulu', u'Count': 3, u'Food': u'japanese'}...}
u'result': [{u'City': u'albuquerque', u'Count': 16, u'Food': u'mexican'},
{u'City': u'albuquerque', u'Count': 10, u'Food': u'pizza'},
{u'City': u'albuquerque', u'Count': 6, u'Food': u'american'}...}
Far-away cities have similar types of fast food entrenchment, until the third choice. Next, let's see who these chains are..
In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
"amenity":"fast_food", "city_name":city}},
{"$group":{"_id":{"City":"$city_name", "Food":"$cuisine"},
"count":{"$sum":1}}},
{"$project":{"_id":0, "City":"$_id.City",
"Food":"$_id.Food",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":6}]
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 13, u'Food': u'burger'},
{u'City': u'honolulu', u'Count': 2, u'Food': u'sandwich'},
{u'City': u'honolulu', u'Count': 2, u'Food': u'sushi'},...}
{u'ok': 1.0,
u'result': [{u'City': u'albuquerque', u'Count': 31, u'Food': u'burger'},
{u'City': u'albuquerque', u'Count': 16, u'Food': u'sandwich'},
{u'City': u'albuquerque', u'Count': 6, u'Food': u'pizza'},...}
Ubiquitous McDonald's and Subway have a far reach.
In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
"amenity":"fast_food","city_name":city}},
{"$group":{"_id":{"City":"$city_name",
"Name":"$name"},
"count":{"$sum":1}}},
{"$project":{"_id":0,
"City":"$_id.City",
"Name":"$_id.Name",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":6}]
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 8, u'Name': u"McDonald's"},
{u'City': u'honolulu', u'Count': 4, u'Name': u'Subway'},
{u'City': u'honolulu', u'Count': 3, u'Name': u'Burger King'},..}
u'result': [{u'City': u'albuquerque', u'Count': 23, u'Name': u'Subway'},
{u'City': u'albuquerque', u'Count': 12, u'Name': u"Blake's Lotaburger"},
{u'City': u'albuquerque', u'Count': 11, u'Name': u"McDonald's"},...}
Auto repair shops are more popular in the expansive southwest. This makes sense since many more miles are put onto automobiles in the much larger state of New Mexico.
In [ ]:
pipeline = [{"$match":{"shop":{"$exists":1}, "city_name":city}},
{"$group":{"_id":{"City":"$city_name", "Shop":"$shop"},
"count":{"$sum":1}}},
{"$project": {'_id':0, "City":"$_id.City",
"Shop":"$_id.Shop", "Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":10}]
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 50, u'Shop': u'supermarket'},
{u'City': u'honolulu', u'Count': 24, u'Shop': u'convenience'},
{u'City': u'honolulu', u'Count': 18, u'Shop': u'clothes'},...}
u'result': [{u'City': u'albuquerque', u'Count': 66, u'Shop': u'convenience'},
{u'City': u'albuquerque', u'Count': 46, u'Shop': u'supermarket'},
{u'City': u'albuquerque', u'Count': 22, u'Shop': u'car_repair'},...}
The most popular convenience stores are regional.
In [ ]:
pipeline = [{"$match":{"shop":{"$exists":1},
"city_name":city, "shop":"convenience"}},
{"$group":{"_id":{"City":"$city_name", "Name":"$name"},
"count":{"$sum":1}}},
{"$project": {'_id':0, "City":"$_id.City",
"Name":"$_id.Name", "Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":5}]
In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 7, u'Name': u'ABC Store'},...}
u'result': [{u'City': u'albuquerque', u'Count': 14, u'Name': u'Circle K'}...}
When checking the output of the data as it was being cleaned, I found that a lot of the earlier data was already cleaned fairly well. This makes sense since these are weekly extracts of data so newer data would not have had time to be cleaned by anyone.
It was also amazing to me how easily multiple cities could be aggregated, and how quickly new cleaning functions could be added into different sub-stages of cleaning.
It doesn't seem that it would be too hard to flesh out more robust cleaning algorithms, and run a script semi-regularly to clean the data automatically.
Also, I did not get a chance to use MongoDB's geospatial querying, but there is a fair amount of lat/lon data that might be able to use this type of querying.
These two cities are fairly small when compared to other more active cities in OpenStreetMap, so API pulls from other databases like google maps might be able to better populate points of interest for the less active areas.
In [19]:
from IPython import utils
from IPython.core.display import HTML
import os
def css_styling():
"""Load default custom.css file from ipython profile"""
base = utils.path.get_ipython_dir()
styles = "<style>\n%s\n</style>" % (open(os.path.join(base,'profile_custom1/static/custom/custom.css'),'r').read())
return HTML(styles)
css_styling()
Out[19]: